DRFx: An Understandable, High Performance, and Flexible Memory Model for Concurrent Languages
نویسندگان
چکیده
execution proceeds. We write (M, I) −→T (M̂, Î) to mean that executing instruction I in machine state M results in machine state M̂ with Î poised to execute next in thread T . We write (M, I) −→T (M̂, Î) to indicate several steps of execution (transitive closure of above). Fence instructions behave as no-ops: (M, HFENCE) −→T (M, I), where I is the next instruction in program order in T and similarly for SFENCE. We extend the notion of a thread’s abstract execution to a program by having execution proceed by choosing any thread and executing a single instruction from that thread. We write: (M, {I1, . . . , Ij, . . . , In}) −→P (M̂, {I1, . . . , Îj, . . . , In}) if (M, Ij) −→T j (M̂, Îj). We call one or more of these steps a (partial) abstract sequential execution: (M, {I1, . . . , In}) −→P (M̂, { Î1, . . . , În}). We define a behavior to be a pair of machine states and denote it by Mstart ! Mend. Intuitively, we use behaviors to describe a starting machine state and a machine state that is arrived at after executing some or all of a program. The standard notion of sequential consistency can be phrased in terms of behaviors and abstract sequential executions. Definition 1. M0 ! M is a sequentially consistent behavior for a program P, or M0 ! M is SC for P, if there exists an abstract sequential execution (M0, {I10, . . . , In0}) −→P (M, {END, . . . , END}) where each Ii0 is the first instruction in thread Ti. We say that M0 ! M is a sequentially consistent partial behavior for P if there is a partial abstract sequential execution (M0, {I10, . . . , In0}) −→P (M, {I1, . . . , In}) where each Ii0 is the first instruction in thread Ti. ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. 16:12 D. Marino et al. We say that two memory access instructions u and v conflict if they access the same memory location, at least one is a write, and at least one is a regular access. We say that a program has a data race if it has a partial abstract sequential execution where two conflicting accesses are ready to execute. More formally: Definition 2. A program P has a data race if for some M0, u, v, (M0, {I10, . . . , In0}) −→P (M, {I1, . . . , u, . . . , v, . . . , In}), where each Ii0 is the first instruction in thread Ti and u and v are conflicting accesses. We shall say that such a partial abstract sequential execution exhibits a data race. This canonical, formal definition captures the notion of a simultaneous data race in an abstract sequential execution. As discussed in Section 2.4, optimized, racy programs can yield results that are not sequentially consistent even on executions where a strict simultaneous data race does not occur. Such executions exhibit a happened-before data race, where conflicting accesses are not ordered by any atomic synchronization operations. In several of our proofs, we look for a happened-before data race in an execution and then rearrange the execution trace to generate a valid execution of the program that exhibits a simultaneous data race. 3.2. DRFx-Compliant Compilation As described informally in Section 2, one of the responsibilities of a DRFx-compliant compiler is to divide a program into code regions that satisfy several requirements. We formally capture these requirements with the notion of a valid thread partition, introduced here. A partition Q of a thread T is a set of disjoint, contiguous subsequences of T that cover T . Call each of these subsequences a region. Regions will be denoted by the metavariable R. Definition 3. A partition Q is valid if: —each atomic operation and END operation is in its own region —each region has a single entry point (i.e., every branch has a target that is either in the same region or is the first instruction in another region) We extend the notion of abstract execution of a thread from instructions to regions as follows. We write (M, R) −→T (M̂, R̂) if (M, I1) −→T · · · −→T (M̂, In), where —I1 is the first instruction in R, —I2, . . . , In−1 ∈ R, —Ik ̸= I1 for each 2 ≤ k < n, and —In is the first instruction in region R̂ (it is possible that R̂ = R). For threads with valid partitions, (M, R) −→T (M̂, R̂) intuitively means that beginning with memory in state M, executing the instructions in R in isolation will result in memory having state M̂ and T ready to execute the first instruction in region R̂. Extending this to programs, an abstract region-sequential execution is one where a scheduler arbitrarily chooses a thread and executes a single region from that thread. We can now formally introduce the notion of region serializability, which will be the key to establishing that an execution of the optimized, compiled program is sequentially consistent with respect to the source program. We define region-serializable behavior for a program P in terms of an abstract region-sequential execution. Definition 4. We say M0 ! M is region-serializable behavior, or RS, for P with respect to thread partitions Qi if there is an abstract region-sequential execution ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. DRFx: An Understandable, High Performance, and Flexible Memory Model 16:13 (M0, {R10, . . . , Rn0}) −→P (M, {R1, . . . , Rn}), where each Ri0 is the first region given by partition Qi for thread Ti. Now let us introduce notation for the read and write sets for a region given a starting memory state. read(M, R) is the set of locations read when executing R in isolation starting from memory state M. write(M, R) is defined similarly. Note that these are sets and not sequences. We can now describe the requirements the DRFx model places on a compiler. Consider a compilation P ! P ′ where each thread Ti in P is partitioned into some number, mi, of regions by Qi. So we have P = {T1, . . . , Tn} = {R11 · · · R1m1 , . . . , Rn1 · · · Rnmn}. Furthermore, the compiled program has the same number of threads and each is partitioned by some Q′ i into the same number of regions as in the original program. So we have P ′ = {R′ 11 · · · R′ 1m1 , . . . , R ′ n1 · · · R′ nmn}. We consider such a compilation to be DRFx-compliant if: (C1) The partitions Qi and Q′ i are valid. (C2) For all i, j, M, we have (M, Rij) −→Ti (M̂, Rik) ⇐⇒ (M, R′ i j) −→T ′ i (M̂, R ′ ik) (C3) For all i, j, M, we have read(M, Rij) ⊇ read(M, R′ i j) and write(M, Rij) ⊇ write(M, R′ i j) (C4) Each region R′ i j in the compiled program contains exactly one fence operation and it is the first instruction. Each of the fences surrounding an atomic operation must be an HFENCE. The fence preceding an END operation also must be an HFENCE. Intuitively, the above definition of a DRFx-compliant compilation requires that a DRFxcompliant compiler choose valid partitions for a program’s threads, perform optimizations only within regions, maintain the read and write sets of each region, and introduce HFENCE and SFENCE instructions to demarcate region boundaries. These fence instructions communicate the thread partitions chosen by a DRFx-compliant compiler to the execution environment. In the next section, we will refer to these as the fence-induced thread partitions of a program. We now state the two key lemmas we have proven for DRFx-compliant compilations. LEMMA 1. If P ! P ′ is a DRFx-compliant compilation and M0 ! M is a regionserializable behavior for P ′ with respect to its fence-induced thread partitions, then M0 ! M is a (partial) sequentially consistent behavior for P. PROOF SKETCH. We can transform an abstract region-sequential execution of P ′ to an abstract region-sequential execution of P due to (C2). Clearly an abstract regionsequential execution qualifies as an abstract sequential execution. LEMMA 2. If P ! P ′ is a DRFx-compliant compilation and P ′ has a data race, then P has a data race. PROOF SKETCH. Essentially, we take a partial abstract sequential (but not necessarily region-sequential) execution of P ′ that exhibits a simultaneous data race, truncate it to the earliest happened-before data race, and reorder the truncated trace while maintaining program dependencies to achieve a trace of P ′ with a region-sequential prefix and a suffix containing a simultaneous race. The ability to perform this reordering and achieve a region-sequential prefix relies critically on (C1), which insists that atomic accesses are in their own region. We can then use (C2) and (C3) to construct an ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. 16:14 D. Marino et al. abstract sequential execution of P exhibiting a (possibly different) data race from the racy execution of P ′.sequential execution of P exhibiting a (possibly different) data race from the racy execution of P ′. Full proofs for the lemmas in this section can be found in Marino et al. [2009b]. 3.3. DRFx-Compliant Execution We now formally specify the requirements that the DRFx model places on a machine executing a program. Note that the requirements for the execution environment are cleanly separated from those for the compiler. All references to a program in this section refer to a compiled, optimized program and not to the original source program. Unlike compilers, machines do not typically perform optimizations that completely transform the input program. They more or less faithfully execute the instructions given to them, reordering instruction execution and locally speculating in order to avoid expensive stalls. This difference is reflected in our formal description of a relaxed execution which records traces of instructions from the input program along with memory ordering information. It is also crucial for our formalism to model the data race detection mechanism in order to prove both Data-Race Completeness (we must show that all detections truly indicate a racy program) and SC-Soundness (we must establish that the lack of a detection precludes certain dynamic memory orderings that would violate region serializability). Note that data-race detection is performed for each region, but just like a machine may have multiple instructions in flight and execute them out of order, a machine may also have multiple regions in flight and perform conflict detection for them out of order, subject to certain restrictions. We will represent a (partial) relaxed execution, E, of a program as a 5-tuple E = (M0, T , EO, RCS, err). Each of the components is described below: —M0 is the initial machine state —T is a set of individual, dynamic thread traces (T = {τ1, . . . , τn}). Each thread trace τi contains all of the instructions executed so far in thread i starting with the first instruction and continuing in program order without skipping over any instructions. (Note that any out-of-order instruction execution performed by the machine is not captured by these traces.) We call this order TO; it totally orders the dynamic instructions executed within a thread and is a partial order on all instructions in the program execution. Each thread trace is divided into dynamic regions (notated using metavariable ρ), with all instructions between two fence instructions in the trace belonging to one dynamic region. This is referred to as the fence-induced partition. We call the fence-induced partition valid if all atomic operations are immediately surrounded by HFENCE instructions. Although, strictly speaking, TO is a relation on instructions, we will also use it to order dynamic regions within a thread trace. —EO is a relation that specifies a partial order on memory accesses. If two operations u and v access the same memory location and at least one of them is a write, then either u <EO v or v <EO u. Furthermore, two operations that do not access the same memory location are not related by EO. EO uniquely defines the write whose value each read sees (i.e., the most recent write to the same location in EO). Note that EO ∪ TO may contain cycles, so the relaxed orderings allowed by optimizations such as out-of-order execution and store buffers are captured by EO rather than by the thread traces. —RCS is a map from dynamic regions to a conflict detection state in the set {uncommitted, lagging, committed}. Intuitively, RCS models a conflict detection mechanism that works on the fence-demarcated regions and moves them through three states as they execute, from uncommitted, possibly to lagging, and, finally, to committed when detection successfully completes with no region conflict found. A conflict detection mechanism may commit SFENCE bounded regions out of program ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. DRFx: An Understandable, High Performance, and Flexible Memory Model 16:15 order, and a lagging region is one that has not yet committed but for which a later region has committed. —err is either ∅ or a single element of EO, u <EO v. Intuitively, a non-empty err will indicate a conflicting pair of accesses in concurrently executing regions which triggers an MM exception. An execution that has err = ∅ is called exception-free, while an execution where err ̸= ∅ is called exceptional. We further define a hardware dependence partial order, D, that captures intra-thread data and control dependencies. D is a subset of TO and orders a read before a TOsubsequent memory access if the value returned from the read is used to compute: the address accessed by the subsequent instruction, the value written by the subsequent instruction, or the condition used by an intervening branch.2 D also orders writes to a location before TO-subsequent accesses to the same location. We say that an execution E = (M0, T , EO, RCS, err) is well formed for a program P if all of the following conditions are met: (WF1) Each thread trace τi represents a valid sequential execution of thread i in P given that each read sees the value written by the (unique) closest preceding write in EO. [Instruction semantics are respected.] (WF2) EO is consistent with D in the sense that u <D v =⇒ ¬v <EO u. Furthermore, EO|wr ∪ D must be acyclic, where EO|wr is the subset of EO containing only writeto-read (i.e., read-after-write) dependencies (u <EO|wr v ⇐⇒ u <EO v ∧ u a write∧ v a read). [Execution order respects intra-thread dependencies, and speculative writes are not visible to other threads.] (WF3) A committed or lagging region never follows an uncommitted region in a thread trace. That is, if there is some ρ such that RCS(ρ) = uncommitted, then for all ρ ′ such that ρ <TO ρ ′, RCS(ρ ′) = uncommitted. [All uncommitted regions appear consecutively at the end of each thread trace.] (WF4) A lagging region always has some committed region following it in its thread trace. That is, for all ρ such that RCS(ρ) = lagging, there exists some ρ ′ such that ρ <TO ρ ′ and RCS(ρ ′) = committed. [A lagging region always has a later region that committed out of order.] (WF5) All regions preceding an HFENCE in a thread trace are committed. No thread trace contains an atomic access without an HFENCE immediately following it. [An HFENCE stalls until conflict detection for all prior regions completes.] Intuitively, conditions (WF1) and (WF2) simply ensure that our machine correctly executes instructions and obeys intra-thread data and control dependencies. In particular, Condition WF2 prevents a machine from speculatively writing a value and making it visible to other threads before a read on which the write depends completes.3 Conditions (WF3) and (WF4) establish some basic conditions that we assume for a conflict detection mechanism. Multiple uncommitted regions may be in-flight in a 2Note that even “artificial” dependencies, where a computation uses a value read from memory in such a way that it does not actually influence the subsequent instruction (for instance, the read value is XOR’d with itself always resulting in 0), are included in D. 3It is interesting to note that the conditions for a DRFx-compliant compilation do allow optimizations that introduce a speculative write within a region, as long as they do not change the read and write sets of any abstract execution of the region. Even so, “out-of-thin-air” results due to such optimizations are prevented by either detecting a data race or ensuring region serializable behavior during execution of the compiled program. By prohibiting the hardware from making speculative writes visible to other threads, we facilitate our proof that if the hardware raises a data-race exception, then the program indeed has a data race. While we could likely relax this prohibition using something similar to the compiler’s requirement to maintain read and write sets on all executions, this seems unnecessarily complex given that most hardware architectures already satisfy the condition. ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. 16:16 D. Marino et al. thread simultaneously. Regions may commit out of order, but when this happens, prior uncommitted regions in the same thread must be classified as lagging regions. Condition (WF5) establishes that HFENCE instructions force all prior regions to commit. Furthermore, atomic operations may not complete (i.e., become visible to other threads) until their region is committed and the succeeding HFENCE is executed. A well-formed execution has well-defined behavior and a conflict detection state that meets some basic structural conditions, but we have not yet specified what conflicts must be detected in order to establish the DRFx guarantees. To be DRFx compliant, an exception-free execution must exhibit region-serializable behavior, and an exceptional execution must imply a racy program. We have devised a set of conditions on the conflict detection mechanism that suffice to establish compliance. They essentially require a machine to commit a region only after all accesses in the region are complete and globally visible, and only if it can guarantee that the region’s accesses either do not conflict with, or are memory-ordered before, any accesses in uncommitted regions on other threads. The conditions allow some flexibility to commit SFENCE-bounded regions out of order but ensure that memory ordering cycles cannot be introduced as a result of the lagging regions. Finally, the conditions require that the detection mechanism reports an exception only if there truly are conflicting accesses on different threads, neither of which is from a region that has already committed. Formally, we call a well-formed execution E = (M0, T , EO, RCS, err) DRFx compliant if it satisfies all of the following conditions: (E1) There exists a total order RO on committed and lagging regions that is consistent with EO ∪ TO lifted to regions. [The set of committed and lagging regions have an order consistent with thread and memory ordering.] (E2) There do not exist a read r ∈ ρr and a write w ∈ ρw such that RCS(ρr) ̸= uncommitted and RCS(ρw) = uncommitted and w <EO r. [Reads in committed and lagging regions do not see writes in uncommitted regions.] (E3) There do not exist a read r ∈ ρr and a write w ∈ ρw such that RCS(ρr) = uncommitted and RCS(ρw) = committed and r <EO w. [Writes from committed regions are visible to reads in uncommitted regions.] (E4) If err = u <EO v, then u ∈ ρu and v ∈ ρv conflict, u and v are from different threads, neither ρu or ρv is committed, and at least one of these regions is uncommitted. Intuitively, the conditions ensure a conflict detection mechanism in which committed regions are guaranteed not to contain accesses that participate in a race that violates region-serializability, while lagging regions are guaranteed to not participate in a race that violates region-serializability with accesses in other lagging regions or committed regions but may participate in a violating race with an access in an uncommitted region. Condition (E1) ensures that any race that would cause committed and lagging regions not to be serializable is caught. Condition (E2) requires that all reads in a region must complete before it or any subsequent region commits. Condition (E3) requires that all writes in a region must complete and be visible to other threads before it commits. We need one more bit of notation to express our key lemmas: an operator on a well-formed, partial relaxed execution that truncates incomplete thread traces to include only committed and lagging regions. Note that Condition WF3 ensures that all uncommitted regions in a thread trace occur consecutively at the end. The truncation operator drops instructions from these uncommitted regions from the end of each trace, removes pairs from EO if at least one operation in the pair has been truncated from its thread trace, removes truncated regions from RCS, and sets err to ∅. We notate this as follows: ⌊(M0, T , EO, RCS, err)⌋ = (M0, ⌊T ⌋, ⌊EO⌋, ⌊RCS⌋,∅). The following lemmas establish two key results for DRFx-compliant executions. Notice that these lemmas allow us to relate relaxed executions to properties that are ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. DRFx: An Understandable, High Performance, and Flexible Memory Model 16:17 defined in terms of abstract, interleaved executions (region-serializable behavior and racy programs). In this way, they can be easily combined with Lemmas 1 and 2, which did not need to reason about relaxed execution ordering. LEMMA 3. If E is a well-formed, DRFx-compliant execution of a (compiled) program P with valid fence-induced thread partitions, then ⌊E⌋ exhibits region-serializable behavior for P w.r.t. to the fence-induced partitions. PROOF SKETCH. This follows quickly from (E1) and (E2). Condition (E2) establishes that any value read by an instruction in ⌊E⌋ was written by an instruction that is also in ⌊E⌋ and thus the truncated execution is well formed. Furthermore, Condition (E1) establishes an order on the regions in ⌊E⌋ that is consistent with both thread order and the way that EO orders conflicting accesses within those regions. This establishes that the execution is serializable w.r.t. the regions. LEMMA 4. If there is a well-formed, exceptional, DRFx-compliant execution of a (compiled) program P with valid fence-induced thread partitions, then P has a data race. PROOF SKETCH. From Lemma 3, we know that the execution has a region-serializable prefix. We then use this to construct an abstract sequential execution of the prefix. Because the execution is exceptional, Condition (E4) guarantees that we have conflicting accesses neither of which is contained in a committed region and at least one of which is from an uncommitted region. We can extend the execution of the prefix to an execution demonstrating a happened-before data race. Essentially, for a program with valid fence-induced thread partitions, a happened-before relation between operations on different threads implies the existence of an HFENCE following one operation on its thread and preceding the other on its thread. Since neither of the conflicting accesses is from a committed region, and Condition (WF5) requires regions preceding an HFENCE to be committed, we know the accesses cannot be related by happened-before. Finally, we derive from this an execution of P that exhibits a simultaneous race.4 Full proofs for the previous two lemmas can be found in Singh et al. [2011b]. Rather than starting with the conditions for well-formed, DRFx-compliant execution, the proofs in the cited technical report are done in the context of the particular architectural design described in Section 6. Conditions (E1) through (E4) capture the supporting lemmas from the technical report that are used to establish the results above.5 3.4. DRFx Guarantees Putting together the lemmas from Sections 3.2 and 3.3, we can prove the following theorem, which ensures that a DRFx-compliant compiler along with a DRFx-compliant execution environment enforce the SC Soundness and Race Completeness properties. We call an execution complete if either it is exceptional (contains a non-null err component) or all the thread traces in the execution terminate in an END operation. THEOREM 1. If P ! P ′ is a DRFx-compliant compilation, and E is a complete DRFxcompliant execution of P ′ with behavior M0 ! M, then either: —E is exception-free and M0 ! M is sequentially consistent behavior for P or —E is exceptional and P contains a data race. 4In fact, there are exceptional, DRFx-compliant executions where the conflict detected is not reachable through an abstract sequential execution, but this can only happen as the result of a previous data race that is reachable. 5Note that an earlier technical report [Marino et al. 2009b] establishes similar results under a different set of conditions that were too restrictive for the eventual hardware design. ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. 16:18 D. Marino et al. The arguments presented above were developed entirely in the context of a low-level machine language. The results can, however, be extended to a high-level source language in the following way. Imagine a “canonical compiler” that translates each highlevel statement into a series of low-level operations that read the operands from memory into registers, perform appropriate arithmetic operations on the registers, and then store results back to memory. Any optimizations are then applied after this canonical compiler is run. We can extend the results to the high-level language simply by requiring that the compiler choose a region partition that does not split up instructions that came from the same high-level source language expression or statement. This argument assumes that the number of memory accesses in the compilation of any statement in the source language is bounded by the maximum region size. If the source language does not guarantee this property, then its compiler can emit a warning in the rare case that a single source statement is forced to span multiple regions warning the user that, in the presence of data races, the statement may not execute atomically and an exception may not be thrown. We further discuss this and related issues in Section 8.3. The definition of a DRFx-compliant execution and Lemma 3 establish that all DRFxcompliant executions are region-serializable up to the latest committed region in each thread. Combining this fact with Lemma 1, we can see that, restricted to committed and lagging regions, a DRFx-compliant execution is SC with respect to the original source program. Note that an HFENCE operation cannot execute until all previous regions in its thread are committed (condition (WF5)). Therefore, requiring that system calls are preceded by HFENCE instructions and only use thread-local data ensures that the behavior they exhibit would have been achievable in an SC execution of the original program. This establishes the Safety property of the DRFx model.6 4. COMPILER AND HARDWARE IMPLEMENTATION There are several possible compiler and hardware designs that meet the requirements necessary to ensure the DRFx properties as described in the previous section. In the next two sections, we describe one concrete approach for the DRFx-compliant compiler and hardware. It is evaluated in the Section 7. The approach is based on two key ideas crucial for a simple hardware design: —Bounded regions: First, the compiler bounds the size of each region in terms of number of memory accesses it can perform dynamically using a conservative static analysis. Bounding ensures that the hardware can perform conflict detection with fixed-size data structures. Detecting conflicts with unbounded regions in hardware would require complex mechanisms, such as falling back to software on resource overflow, that are likely to be inefficient. —Soft fences: When splitting regions to guarantee boundedness, the compiler inserts a soft fence. Soft fences are distinguished from the fences used to demarcate synchronization operations and system calls that are called hard fences. While hard fences are necessary to respect the semantics of synchronization accesses and guarantee the properties of DRFx, soft fences merely convey to the hardware the region boundaries across which the compiler did not optimize. These smaller, soft-fence-delimited regions ensure that the hardware can soundly perform conflict detection with fixed-size resources. But it is in fact safe for the hardware to reorder instructions across soft fences whenever hardware resources are available, essentially erasing any hardware performance penalty due to the use of bounded-size regions. 6Condition E2 is also essential in establishing the Safety property since it ensures that no read preceding a system call sees a write from an uncommitted region that might not be part of an SC execution. ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. DRFx: An Understandable, High Performance, and Flexible Memory Model 16:19 5. DRFx-COMPLIANT COMPILER A DRFx-compliant compiler was built by modifying the LLVM compiler [Lattner and Adve 2004]. As specified by the requirements (C1) through (C4) in the previous section, to ensure the DRFx properties, the compiler must simply partition the program into valid regions, optimize only within regions, avoid inserting speculative memory accesses, and insert fences at region boundaries. 5.1. Inserting Hard Fences for DRFx Compliance A hard fence is similar to a traditional fence instruction. The hardware ensures that prior instructions have committed before allowing subsequent instructions to execute and the compiler is disallowed from optimizing across them. To guarantee SC for racefree programs, the compiler must insert a hard fence before and after each synchronization access. On some architectures, the synchronization access itself can be translated to an instruction that has hard-fence semantics (e.g., the atomic xchg instruction in AMD64 and Intel64 [Boehm and Adve 2008]), obviating the need for additional fence instructions. In the current implementation, the compiler treats all calls to the pthread library and lock-prefixed memory operations as “atomic” accesses. In addition, since the LLVM compiler does not support the atomic keyword proposed in the new C++ standard, all volatile variables are treated as atomic. All other memory operations are treated as data accesses. To guarantee DRFx’s Safety property, a DRFx-compliant compiler should also insert hard fences for each system call invocation, one before entering the kernel mode and another after exiting the kernel mode. Any state that could be read by the system call should first be copied into a thread-local data structure before the first hard fence is executed. This approach ensures that the external system can observe only portions of the execution state that are reachable in some SC execution. Transforming system calls in this way is not implemented in the compiler used for the experiments in Section 7. To insert a hard fence, the compiler uses the llvm.memory.barrier intrinsic in LLVM with all ordering restrictions enabled. This ensures that the LLVM compiler passes do not reorder memory operations across the fence. LLVM’s code generator translates this instruction to an mfence instruction in x86, which restricts hardware optimizations across the fence. 5.2. Inserting Soft Fences to Bound Regions In addition to hard fences, the compiler inserts soft fences to bound the number of memory operations in any region. Soft fences are inserted using a newly created intrinsic instruction in LLVM that is compiled to a special x86 no-op instruction that can be recognized by the DRFx hardware simulator as a soft fence. The compiler employs a simple and conservative static analysis to bound the number of memory operations in a region. While overly small regions do limit the scope of compiler optimizations, experiments show that the performance loss due to this limitation is about 6.2% on average (Section 7). After inserting all the hard fences described earlier, the compiler performs function inlining. Soft fences are then inserted in the inlined code. A soft fence is conservatively inserted before each function call and return and before each loop back-edge. Finally, the compiler inserts additional soft fences in a function body as necessary to bound region sizes. The compiler performs a conservative static analysis to ensure that no region contains more than R memory operations, thereby bounding the number of bytes that can be accessed by any region. The constant R is determined based on the size of hardware buffers provisioned for conflict detection. The above algorithm prevents compiler optimizations across loop iterations, such as loop-invariant code motion, since a soft fence is inserted at each back-edge. However, ACM Transactions on Programming Languages and Systems, Vol. 38, No. 4, Article 16, Publication date: September 2016. 16:20 D. Marino et al. it would be possible to apply a transformation similar to loop tiling [Wolfe 1989] which would have the effect of placing a soft fence only once every R/L iterations, where L is the maximum number of memory operations in a single loop iteration. Restructuring loops in this way would allow the compiler to safely perform compiler optimizations across each block of R/L iterations. 5.3. Compiler Optimization After region boundaries have been determined, the compiler may perform its optimizations. By requirements (C2) and (C3), any sequentially valid optimization is allowed within a region, as long as it does not introduce any speculative reads or writes since they can cause false conflicts. As such, in the current implementation, all speculative optimizations in LLVM are explicitly disabled.7 Note, however, that there are several useful speculative optimizations that have simple variants that would be allowed by the DRFx model. For example, instead of inserting a speculative read, the compiler could insert a special prefetch instruction that the hardware would not track for purposes of conflict detection. The Itanium ISA has support for such speculation [Triebel et al. 2001] in order to hide the memory latency of reads. Also, as shown earlier in Figure 6, loop-invariant code motion is allowed by the DRFx model, as long as the hoisted reads and writes are guarded to ensure that the loop body will be executed at least once and the loop block is contained in a region. As described in the previous section, regions could be constructed to have multiple iterations of a loop within a soft-fenced region, over which the compiler is able to perform loop-invariant code motion and other sequentially valid optimizations. 6. DRFx-COMPLIANT HARDWARE: DESIGN AND IMPLEMENTATION This section discusses the proposed DRFx processor architecture. A lazy conflict detection scheme using bloom filter signatures is described, as well as several optimizations that allow efficient execution in spite of the small, bounded regions created by the DRFx compiler. We first give a brief overview of the design and then delve into more detail.
منابع مشابه
A Comparison of Implicit and Explicit Parallel Programming
The impact of the parallel programming model on scientiic computing is examined. A comparison is made between Sisal, a functional language with implicit parallelism, and SR, an imperative language with explicit parallelism. Both languages are modern, high-level, concurrent programming languages. Five diierent scientiic applications were programmed in each language, and evaluated for programmabi...
متن کاملSurvey of Brain Waves and Time-Based Prospective Memory Performance in University Students of Kurdistan in 2015-2016: A Descriptive Study
Background and Objectives: Prospective memory refers to the ability of remembering some special tasks in the future. Prospective memory performance is very crucial for daily life. Therefore, the current study aimed to determine brain waves and prospective memory performance among Kurdistan university students in 2015-2016. Materials and Methods: In this descriptive study, the statistical popul...
متن کاملTranslation of Cultural Terms: A Case Study of a Novel Titled ‘For One More Day’
Translating the cultural terms in an understandable way for the target readers can be challenging for translators. Translators should be familiar with the cultures of both languages (i.e. source and target languages). The present study aimed to show that which cultural terms strategies are more common in translation of the novel titled “For One More Day” based on Aixela‟s model. This study also...
متن کاملRecent Advances in Memory Consistency Models for Hardware Shared Memory Systems
The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. The memory consistency model for a system typically involves a tradeoff between performance and programmability. This paper provides an overview of recent advances in hardware optimizations, compiler optimizations, and programming environments relevant...
متن کاملRecent Advances in Memory Consistency Models forHardware Shared - Memory
| The memory consistency model of a shared-memory system determines the order in which memory operations will appear to execute to the programmer. The memory consistency model for a system typically involves a tradeoo between performance and programmability. This paper provides an overview of recent advances in hardware optimizations , compiler optimizations, and programming environments releva...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- ACM Trans. Program. Lang. Syst.
دوره 38 شماره
صفحات -
تاریخ انتشار 2016